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ABSTRACT 

A  Bayesian  procedure  is  described  for  smoothly  estimating  unknown  func¬ 
tions,  given  a  finite  set  of  observations.  It  is  assumed  that  a  suitable 
transformation  of  the  function  can  be  taken  to  possess  a  Gaussian  prior  dis¬ 
tribution  across  function  space.  The  five  special  cases  (a)  estimation  of  a 
logistic  density  transform,  <b)  the  log  intensity  function  of  a  non- 
homogeneous  Poisson  process,  (c)  the  log  hazard  function  for  survival 
data,  (d)  the  logit  function  in  biossay,  and  (e)  the  mean  value  function 
in  a  possibly  non-linear  time  series  of  the  Kalman  type  or  equivalently  a 
regression  function  for  possibly  non-normal  observations,  are  considered,  and 
in  each  case  a  non-linear  Fredholm  equation  is  described  for  the  posterior 
estimate.  In  cases  (d)  and  (e)  this  reduces  to  a  finite  dimensional  system. 

In  all  five  cases  an  approximate  procedure  is  developed  which  is  particularly 
useful  when  the  sample  size  is  large.  This  approximates  the  function  space 
prior  by  a  multivariate  normal  prior  on  the  coefficients  in  a  linear  approxi¬ 
mation,  and  then  proceeds  by  conventional  Bayesian  techniques.  In  cases  where 
the  prior  covariance  kernel  is  assumed  to  posses  a  particular  parametrized 
form  (e.g.  from  the  Orstein-Ulenbeck  process)  the  approximations  enable  us  to 
estimate  the  prior  parameters  appearing  in  the  kernel,  empirically  from  the 
data,  via  a  lemma  relating  to  the  EM  algorithm.  Finally,  in  a  very  special 
case,  involving  normal  observations  and  an  integrated  wiener  prior,  some  fresh 
Bayesian  and  empirical  Bayesian  results  are  developed. 

AMS(MOS)  Subject  Classifications:  62G05,  62M10,  62H12 

Key  Words:  Empirical  Bayes;  function  space;  density  estimation; 

non-homogeneous  Poisson  process;  survival  data;  hazard  function; 
biossay;  logit;  time  series;  non-linear  filtering;  Kalman; 
Gaussian  process;  Orstein-Uhlenbeck;  covariance  kernel; 

EM  algorithm. 
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SIGNIFICANCE  AND  EXPLANATION 


Prior  function  space /non- linear  Fredholm  equation  procedures  are  con¬ 
sidered  for  the  smooth  estimation  of  unknown  functions  given  finite  sets  of 
observations.  Situations  considered  include  (a)  density  estimation,  (b) 
estimation  of  the  intensity  function  for  a  non-homogeneous  Poisson  process, 

(c)  the  hazard  function  for  survival  data,  (d)  the  logit  function  in 
biossary,  and  (e)  the  regression  function  or  time-varying  mean  value 
function  for  possibly  non-normal  observations.  The  problem  may  be  approxi¬ 
mately  reduced  to  a  Bayesian  analysis  in  finite-dimensional  space  thus  leading 
to  simple  approximate  solutions  to  the  Fredholm  equations.  This  facilitates 
the  empirical  estimation  of  the  smoothing  and  shrinkage  parameters  appearing 
in  the  prior  covariance  kernel,  via  a  lemma  relating  to  the  EM  algorithm. 
Finally,  in  a  well-known  special  case,  the  estimation  of  a  normal  mean  value 
function  under  an  integrated  Wiener  prior  is  considered  and  some  fresh  pro¬ 
cedures  are  described  for  making  inferences  about  the  smoothing  parameter  X. 
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AN  EMPIRICAL  BAYESIAN  APPROACH  TO  THE 
SMOOTH  ESTIMATION  OF  UNKNOWN  FUNCTIONS 

Tom  Leonard 


1 .  INTRODUCTION . 

We  consider  situations  where  the  likelihood,  given  the  observation  vector 
y,  acts  as  an  integral  operator  upon  an  unknown  function  X ( t ) ,  for  t  e 
(a,b).  In  many  such  situations  it  is  possible  to  find  a  mapping  5  such  that 

X  =  5(g) 

where  g(*):(a,b)  ♦  R^  is  not  subject  to  any  functional  or  inequality  con¬ 
straints.  The  problem  considered  in  this  paper  is  the  smooth  estimation  of 
g  and  hence  X  ,  given  the  observation  vector  y. 

Our  approach  will,  for  example,  be  applicable  to  the  following  situa¬ 
tions: 

T 

(a)  Let  y  =  where  y^,***,y  constitute  a  random  sample  from 

a  distribution  with  unknown  density  X(t)  for  t  e  (a,b).  Then  (see  Leonard, 
1978)  it  is  often  useful  to  work  in  terms  of  a  logistic  density  transform  g 
satisfying 

X (t)  =  eg(t)/Jb  e*(s)ds  (1.2) 

a 

which  avoids  the  constraints  that  X  is  non-negative  and  integrates  to 
unity.  In  this  case  the  log-likelihood  functional  of  g,  given  y  is 
denoted  by 

L(g)  =  1  g(y.)  ~  n  log  fb  eg*S^ds. 

i=1  1  a 

(b)  Let  y^,***,y  denote  the  arrivals  during  a  fixed  interval  (a,b)  for  a 
time-varying  Poisson  process  with  unknown  rate  function  X(t)  for  t  e  (a,b). 
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Here  n  is  itself  a  random  variable  and,  for  n  >  1,  the  log-likelihood 
of  g  -  log  A,  given  g  and  n,  is 

L(g)  ■  l  g(y. )  -  /b  e9<s*ds.  (1.3) 

i-1  a  T 

(c)  Let  y  *  ^1 '*** '^r,yr+1 ' *** '^n'  '  r  is  a  random  variable,  with 

y^+1 , • • * zy^denoting  fixed  censoring  times,  and  suppose  we  are  concerned  with 

the  estimation  of  the  survivor  density  A(t)  for  t  9  (0,*»),  from  which  the 
uncensored  observations  y1 , • • • ,yf  are  a  random  sample.  Let 
h(t)  =  A(t)/( 1  -  A<t)) 

with 

A(t)  *  ft  A(s)ds 
0 

denote  the  hazard  function.  Then  the  log-likelihood  of  the  log-hazard 

function  g(t)  =  log  A(t)  is  given  by 

L(g)  *  X  g(yt)  -  X  /  1  e9*8*ds.  (1.4) 

i-1  1  i-1  0 

(d)  Consider  the  classical  biossay  problem  where  n  rats  receive  dose  levels 
x1,»»*,xn  and  we  wish  to  estimate  the  function  A(t)  for  t  e  (0,°»)  which 

represents  the  probability  that  a  rat  will  die  if  it  receives  dose  level  t. 


Let 


*i 


1  if  rat  with  dose  level  i 
0  otherwise.  (i  > 

Then  the  log-likelihood  of  the  logit  function 
g(t)  -  log  0(t)  -  log[1  -  0 ( t ) 1 


dies 
1 » • • • ,n) 


is  given  by 

n  n  g(x.) 

L(g)  *  X  y1g(x1 )  -  X  i°gO  +  e  }.  (i.5) 

i-1  x  i-1 

(e)  Suppose  that  we  observe  n  independent  observations  y, ,***,y  relating 

i  n 

to  an  unconstrained  function  g(t)  for  t  6  (a,b)  such  that  the  density  or 


probability  mass  function  b (y^ >g ( t^ ) )  of  y^^  given  g,  depends  upon 
g(t^)  but  not  further  upon  g. 
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For  example,  t  might  denote  a  time  variable  and  the  y^  might  be 
independent  and  Poisson  distributed  with  mean  equal  to  expfgtt^)}.  When  we 
add  a  further  distribution  to  g  we  see  that  this  will  yield  a  time  series 
model  for  Poisson  observations,  with  possibly  unequal  time  points.  Alter¬ 
natively  t1 , * • • , t^  might  correspond  to  explanatory  variables  so  that  we  are 


in  a  non-linear  regression  situation.  In  any  case 

n 

L(g)  ■  I  log  b(y.;g(t. )) 
i=1 

for  which  (1.5)  is  a  special  case  in  the  binary  situation. 


(1.6) 


In  all  five  cases  it  is  necessary  to  assume  some  prior  information  in 
order  to  estimate  the  whole  curve  g,  given  only  a  finite  set  of  observa¬ 
tions.  A  wide  range  of  prior  distributions  are  contained  in  the  Gaussian 
family.  In  each  case  it  might  be  reasonable  to  take  {g(t);t  e  (a,b)}  to 
assume  the  probability  structure  of  a  Gaussian  process  with  mean  value 


function  u(t),  and  covariance  kernel 

cov ( g ( s ) , g ( t ) )  =  K(s,t)  (s  6  (a,b);  t  8  (a,b)). 
For  example,  the  choice 


(1.7) 


cov(g<1 } , (s) ,  g(1,(t))  =  o2e"0'S"tl  (0  <  02  <  »;0  <  8  <  -)  (1.8) 

of  Orstein-Uhlenbeck  kernel  yielded  sensible  results  in  cases  (a)  and  (b),  as 
discussed  by  Leonard  (1978),  a  paper  which  we  will  refer  to  as  (L).  The 
latter  is  a  continuous  version  of  the  method  for  histograms  discussed  by 
Leonard  (1973). 

2 

Note  that  the  prior  parameter  o  in  (1.8)  measures  the  closeness  of 

g*1^  to  its  prior  estimate  and  8  measures  the  smoothness  of  g*1* 

and  g.  These  may  be  referred  to  as  the  "shrinkage"  and  "smoothing" 
parameters;  we  will  later  describe  how  they  may  be  estimated  empirically  from 
the  data. 
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The  prior  mean  value  function  u  may  be  taken  to  represent  the 
statistician's  null  hypothesis  for  ti.  For  example,  in  the  biossay  situation 
(d)  he  might  take 

M  (t)  *  o  +  6t  for  t  6  (0,*) 

representing  a  hypothesized  logistic  linear  model*  If  any  unknown  parameters 
appear  in  the  null  hypothesis  then  they  could  be  estimated  by  standard  para¬ 
metric  techniques,  under  the  hypothesized  assumption  that  u  represents  the 
true  model. 

For  the  Poisson  process  (b)  and  the  time  series  example  in  (c)  the  extra, 
Gaussian,  stage,  to  the  distributional  assumptions,  could  be  taken  as 
representing  further  time  dependence  in  the  sampling  model.  For  example,  in 
(b)  the  two  stages  of  the  distributional  assumptions  together  give  a  broad 
representation  of  doubly  stochastic  Poisson  processes  (see  Snyder  (1975)) 
whilst  in  (e)  we  obtain  a  large  class  of  possibly  non-linear  hierarchical  time 
series  models  of  the  Kalman  (1961)  type. 

We  will  propose  two  possible  methods  of  estimation  for  g(»).  The  first 

gives  a  "posterior  maximum  likelihood  estimate"  of  g  as  the  solution  of  a 

typically  non-linear  Fredholm  integral  equation.  The  second  method  will  just 

be  described  under  the  particular  covariance  structure  in  (1.8).  It  promises 

to  be  useful  operationally  since  it  also  permits  the  empirical  estimation  of 

2 

the  prior  covariance  parameters,  o  and  6.  This  involves  approximating 
g  by  a  linear  combination  of  known  basis  functions,,  projecting  our  function 
space  prior  onto  an  appropriate  prior  for  the  unknown  parameters  in  the  linear 
combination,  and  completing  a  Bayesian/empirical  Bayesian  analysis  in  finite 
dimensional  space.  The  first  procedure  is  summarized  in  the  next  section. 
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2.  POSTERIOR  ESTIMATION  OF  g. 

By  an  extension  to  a  result  described  on  page  141  of  the  discussion  of 

T  it  is  possible  to  estimate  g  by  the  solution  g  to  the  integral  equation 

g(t)  -  M(t)  +  /b  K(s,t)n(g,s)ds  t  6  (a,b)  (2.1) 

a 

where  the  function  {n(g,t)j  t  6  (a,b)}  is  chosen  to  ensure  that 


3L(g  +  eu) 
3e 


e=0 


rb  v 

J  n(g,s)u(s)ds 


(2.2) 


is  zero  for  all  integrable  functions  (u(t);  t  @  (a,b)}. 

>» 

In  our  five  special  cases  this  yields  the  following  choices  for  n(g,t): 


(a) 

(b) 

(c) 

(d) 
and 

(e) 
where 


$  (t)  -  n  exp{g(t)}//  eg^s*ds 


♦  n(t)  -  exp{g(t)} 

*rU)  "  J,  Xtt  <  yj 


(t)exp{g(t)} 


n 

l  (y.  -  exp{g(x. )}/(1  +  exp{g(x. })]6  (t) 

i=1  X  1  i 


n  3  log  bly  ,-g(t  )) 

J,  — —  \,t> 


V"  -  l  V'*’ 

1=1  1 


(2.3) 

(2.4) 

(2.5) 

(2.6) 

(2.7) 

(2.8) 


with  6  (t)  denoting  the  Dirac-delta  function  at  t  *  y^  and  I[t<y 

yi  yi 1 

the  indicator  function  for  [t  <  y^]. 

In  the  biossay  example  (d)  we  obtain  from  (2.1)  the  equation 


„  n  g(x. )  g(x  ) 

g(t)  =  w ( t)  +  l  K(x  » t ) [y .  -  e  /(1+e  )] 

i=1 


(2.9) 


for  t  €  (afb). 

In  this  case,  and  similarly  in  (e),  the  n+1  quantities  g(t),g(x^), 

•••,g(x  )  should  be  interpreted  as  the  joint  posterior  modes  of  g(t), 
n 

g(x- ),•••, g(x  ).  This  is  because  equation  (2.9)  may  be  easily  obtained  by 
•  n 

multiplying  the  likelihood  functional  exponentiating  (1.5)  by  the  (n+1  )- 
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dimensional  multivariate  normal  prior  distribution  of  these  quantities,  and 
then  maximizing  the  consequent  posterior  distribution. 

Equation  (2.9)  may  be  solved  computationally  by  firstly  solving  the  n- 
dimensional  non-linear  system  obtained  by  replacing  t  by  each  x^  in 
turn.  Newton-Raphson  will  be  adequate  for  solving  this  system  for  g(x^ ), 
***,g(xn).  Then,  in  terms  of  these  n  quantities,  (2.9)  provides  an  ex¬ 
plicit  interpolation/extrapolation  formula  for  g(t),  so  that  the  complete 
continuum  may  be  immediately  estimated. 

■v 

Note  that  the  solution  g(t)  to  (2.9)  will,  for  a  smooth  covariance 

kernel,  possess  similarly  smooth  regularity  properties.  For  example,  under 

the  choice  in  (1.8)  the  equations  become 

,  „ 

g(t)  -  |i(t)  +—  l  (expt-flU-x^)  +  20(t-xi)I  <t.,(t))[yi  -  - — - ] 

0  i*1  1  i  g(x.  ) 

1+e 

+  further  terms  not  involving  t.  (2.10) 

Upon  differentiating  the  expression  on  the  right  hand  side  it  is  easily 
checked  that  g  possesses  a  continuous  second  derivative.  This  is  preferable 
to  choosing  the  covariance  kernel  in  (1.8)  for  g  rather  than  1 ^  since 
this  leads  to  an  estimate  with  a  discontinuous  first  derivative. 

Under  very  wide  regularity  conditions  it  is  easy  to  show  that  the  solu- 

V  V 

tions  for  g(x, ),•••, g(x  )  in  (2.9)  become  strongly  consistent  as  n  ♦  •  for 
i  n 

the  corresponding  true  values  g(x1 )»••*, gfx^).  We  simply  require  the 

“1  r 

almost  sure  convergence  of  m  I  K(x^,t)y^  to  its  sampling  expectation  i.e. 


If  this  is  true  then  the  second  term  on  the  right  hand  side  of  (2.9)  will 
become  arbitrarily  large,  and  the  expression  in  (2.11)  will  therefore  become 
equated  with  a  similar  expression  but  with  the  g(x^)  replaced  by  the  corres- 
ponding  g(x^).  Hence  a  consistent  set  of  solutions  will  exist. 

It  is  necessary  to  be  more  careful  when  interpreting  our  estimates  in 
cases  (a),  (b),  and  (c),  since  the  integral  equations  can  no  longer  be  reduced 
to  an  n+1  dimensional  finite  dimensional  system.  The  best  interpretation 
follows  from  a  suggestion  by  Whittle  on  page  136  of  the  discussion  of  (L).  He 
describes  how  our  integral  equations  may  be  derived  as  the  limit  of  a  high¬ 
dimensional  non-linear  system  obtained  by  discretization  of  the  interval 
(a,b),  and  working  with  the  high  dimensional  multivariate  normal  prior  distri¬ 
bution  discretizing  our  Gaussian  prior  on  function  space.  In  the  discretized 
case  the  solutions  of  the  non-linear  system  are  the  modes  maximizing  the 
posterior  density.  In  the  function  space  limit  it  is  no  longer  meaningful  to 
think  in  terms  of  modes  since  the  posterior  Radon-Nikodym  derivative  depends 
upon  a  choice  of  dominating  measure.  So  the  estimates  may  be  referred  to  as 
"limiting  posterior  modes". 

The  integral  equations  in  these  three  cases  can  alternatively  be  derived, 
for  a  fairly  wide  range  of  covariance  kernels,  via  a  direct  optimization 
scheme  on  function  space,  involving  prior  and  posterior  "likelihoods"  -  see 
pages  117-8  of  (L).  Hence  our  estimates  can  also  be  interpreted  as  posterior 
maximum  likelihood  estimates. 

Note  that  the  regularity  and  consistency  properties  described  above  for 
case  (d)  also  hold  in  the  other  five  cases;  in  the  first  three  cases  the  whole 
continuum  of  g  will  be  consistent  for  g. 
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3.  SOME  APPROXIMATE  METHODS 


In  cases  (a),  (b),  and  (c)  it  is  particularly  tedious  to  solve  the  non¬ 
linear  equations  iteratively,  and  more  difficult  to  find  a  procedure  esti¬ 
mating  any  hyperparameters  appearing  in  the  covariance  kernel.  We  therefore 
demonstrate  how  to  obtain  approximate  solutions,  under  the  special  assumption 
in  (1.8),  but  for  all  five  cases  (a),  (b),  (c),  (d),  and  (e).  These  methods 
are  particularly  useful  when  n  is  large. 

In  (L)  I  show  that,  under  this  Orstein-Uhlenbeck  covariance  structure, 
the  solution  for  g  to  (2.1)  is  a  maximizer  of  the  functional 

L.,(g)  *  L(g)  +  LQ(g)  (3.1) 

where 

-2L  (g)  /b{g(2)(t>  -  uC2)(t)}2dt 

°  a 

+  ~  fb  { g(1)(t)  -  u(1)(t)}2dt 
o  a 

+  -V  {g(1)(a)  -  w(1)(a)}2  +  ^  {g(1 ’(b)  -  u(1)(b)}2  (3.2) 

a2  or2 

is  obtained  via  a  Radon-Nikodym  derivative  for  the  prior  distribution.  Note 
that  L1 (g)  in  (3.1)  plays  the  role  of  a  penalized  log-likelihood,  since 
-2LQ(g)  may  be  interpreted  as  a  roughness  penalty.  This  term  penalizes  high 
first  and  second  derivatives  for  the  estimated  g  function. 

Suppose  now  that  we  approximate  g  by  the  linear  functional 

g*(t)  =  XT$(t)for  t  6  (a,b)  (3.3) 

where  1  is  a  p  x  1  vector  of  unknown  parameters,  and  $(t)  is  a  p  x  i  vector 
of  known  functions.  For  example,  $  could  consist  of  p  orthogonal 
polynomials;  alternately  $  could  contain  the  basis  of  B-splines  for  which 
g*  is  a  general  cardinal  spline  with  p  knots.  Note  that  this  choice  need 
not  in  practice  be  related  to  regularity  conditions  on  the  "true"  function; 
since  the  true  function  is  hypothetical  enough  for  us  to  be  able  to  fix  our 


own  regularity  conditions 


1 


Under  the  linear  constraints  in  (3.3),  maximization  of  L., (g)  with 
respect  to  g  is  equivalent  to  maximization  of  L^x)  with  respect  to  X' 


where 


Vi>  =  L(*}  +  V** 


(3.4) 


where 


-2L0(x)  =  (X  -  u)  R(X  -  Id) 


(3.5) 


with 


R  =  — A_  +  — -  A  +  — -  A 

'  (So2  '2  a2  ^  o2  '° 


(3.6) 


and 


r  u  =  — d  +  — -  d  +  -1—  d 

'  “  SO2  "2  a2  ''  o2 


(3.7) 


where 


A  =  /  $(2)(t)$(2)T(t)dt 


/  $(1>(t)$(1 )T(t)dt 


0>,.lA<1)T 


M)/w*a(DT# 


Aq  =  $  (a)^  (a)  +  t  <b)t  (b) 


d2  =  /  w(2)(t)*(2)(t)dt 


d,  =  /  u'U<t)$U'<t)dt 

dQ  =  p(1)(a)$(1U)  +  i i  (1|b)$  °|b). 


The  function  in  (3.4)  will  be  minimized  when  X  =  X  »  where 
3L(X) 

_____  =  r(X  _ 

In  the  five  cases  introduced  in  the  last  section,  the  derivative  on  the 


(3.8) 


left  hand  side  of  (3.8)  is  given  by 

ft  ,  rb 

(a)  t  -  n  /  $(t)exp(x  $(t)}dt/j  e  dt 


(3.9) 


where  t  =  £  $(x. ) 

i=1  1 
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(3.10) 


(b) 


(c) 


t  -  Jb  ^(t)exp{^T^(t)}dt 
a 


n 

where  £  =  £  ^(x.  ) 

i=1  1 


n  y.  >_ 

t  -  l  /  ^(t)exp(x  fc(t)}dt 
i=1  a 

r 

where  t  -  l  $(x  ) 
i*1 


(3.11  ) 


(d) 


xT 

ft<v 


? 


1+e 


(X.) 


(3.12) 


n 

where  t  =  J  yi^xi* 
i=1  1 

and 

n  a 

(e)  l  lit  )  j-  b(y  .,u)|  .  (3.13) 

i*1  U  |u  =  f^t) 

Note  that,  in  the  first  four  cases,  there  are  p-dimensional  sufficient 
statistics  for  given  the  approximations.  In  the  spline  case,  t  will 

consist  of  sample  B-splines,  and  in  the  polynomial  case  it  will  contain  sample 
moments.  Therefore  the  choice  of  basis  functions  should  perhaps  depend  upon 
which  statistics  are  thought  to  be  most  relevant. 

Equation  (3.8)  may  in  general  be  solved  by  Newton-Raphson,  combined  with 
an  integration  routine  for  cases  (a),  (b),  and  (c).  The  convergence  will  be 
faster  than  ordinary  maximum  likelihood  for  owing  to  the  normalizing 
effect  on  the  Hessian  of  the  R  matrix  on  the  right  hand  side.  Then  the 

»  vr 

solution  g  to  (2.1)  may  be  approximated  by  1  $(t).  A  spline  basis  gives 
excellent  error  terms;  p  »  10  or  15  should  be  adequate;  the  choice  of  p 
can  be  juggled  to  check  accuracy  of  the  computations.  This  seems  to  give  a 
computationally  useful  way  of  approximating  the  solution  to  our  non-linear 
integral  equations. 
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4.  APPROXIMATE  POSTERIOR  PROBABILITIES 

Owing  to  the  simplicity  of  the  quadratic  form  in  (3.5)  our  approximate 
estimation  procedure  is  equivalent  to  an  exact  Bayesian  analysis  for  X  under 
the  specified  log  likelihood  LCJ),  and  the  multivariate  normal  prior 

X  "  N(^,R-1).  (4.1) 

We  therefore  propose  the  mean  vector  and  precision  matrix  in  (3.6)  and 
(3.7)  as  possessing  reasonable  structures  for  meaningfully  representing  prior 
information  about  the  coefficients  in  a  linear  approximation.  We  are  unfa¬ 
miliar  with  other  candidates  for  this,  even  in  the  straightforward  polynomial 
regression  situation. 

Under  this  finite  dimensional  Bayesian  analysis  the  exact  posterior 
density  of  X  is 

x(^|y)  «  exp{L(T£)  -  V2  <X  ~  ~  ld>}  (4.2) 

N. 

and  X  satisfying  (3.8)  is  the  exact  posterior  mode  vector.  The  mode  vector 
y  is  well-known  to  be  a  first  approximation  to  the  posterior  mean  vector; 
refinements  may  if  necessary  be  obtained  via  an  Edgeworth  expansion  to  the 
posterior  density.  Similarly  the  posterior  covariance  matrix  of  X  may  be 
approximated  by  the  exact  posterior  dispersion  matrix  D  (the  inverse  of  the 


Hessian  in  the  Newton-Raphson  iterations  for  solving  (3.8)),  where 

_1  -32logw (Xlx>  -32b(^) 

5  ..  ip  ^  v * 

3<n  )  3<u> 


(4.3) 


This  gives  the  following  approximation  to  the  posterior  distribution. 


which  could,  if  required,  be  replaced  by  exact  expansions: 
lIX  '  N<X,D). 


(4.4) 
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Hence  the  posterior  distribution  of  the  g  function  is  approximately 
Gaussian  with  mean  value  function 

g(t)  -  tfyt)  (4.5) 

and  covariance  kernel 

cov(g(s ) ,g(t) )  =$T(s)D$(t)  (4.6) 

yielding  the  possibility  of  very  general  posterior  probability  statements 
about  g. 
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5.  THE  ESTIMATION  OF  02  and  6 

2 

It  is  possible  to  estimate  a  and  0  by  approximating  the  values 
maximizing  their  "marginal  log-likelihood" 

log  p(y|02,0)  =  log  /  p(^|^)ir(j|<J2,&)da2d0.  (5.1) 

Note  that,  using  similar  notation,  it  is  possible  to  calculate  (5.1) 
approximately,  using 

log  p(y|o2,0)  =  log  p(y)x>  +  log  w (-^ | a2 , 0 )  -  log  "(xly^.B) 

“  KI)  -V2log|D|  -1/2log|R|  +  V2  ( t  -  -  !£>•  (5.2) 

This  approximation  is  based  upon  a  backwards  application  of  Bayes  theorem 
together  with  the  normal  approximation  to  the  posterior  density  of  y  . 

Hence  it  is  possible  to  calculate  an  approximation  to  the  whole  marginal  log- 
likelihood  of  o2  and  0. 

2 

In  order  to  obtain  point  estimates  for  o  and  0  we  refer  to  the 
following  lemma,  which  could  be  viewed  as  a  special  case  of  the  Bi  algorithm; 
see  Dempster,  Laird,  and  Rubin  (1977). 

Lemma;  Suppose  that  the  sampling  distribution  of  a  vector  of  observations  y 

depends  upon  p  unknown  parameters  Y, ,••• ,1  and  that  the  prior  distribution 

l  p 

of  depends  upon  r  <  p  prior  parameters  01,»»»,0r.  Suppose 

further  that  with  q  >  r  the  prior  density  of  belongs  to  the 

q-parameter  exponential  family 

*<Xl(i)  =  exp{  \  v  (£)t.(j)  +  a(£)}  (5.3) 

j-1  3  3 

where  a  and  the  v^  do  not  depend  upon  \  and  the  t^  do  not  depend  upon 
(J.  Then,  if  the  marginal  likelihood  of  £  is  differentiable,  it  is  maxi¬ 
mized  by  values  satisfying  the  equation 
q  3v  (£) 

l  “gj-—  [E(t  <X)j£]  -  E[t  <X>|£,y)]  -  0  (k  -  1, •••,?).  (5.3) 

j-1  k  J  3 

The  proof  of  this  lemma  is  straightforward  upon  differentiating 
log  p(y||)  -  log  /  p(ylx)*(x|£)dx 


with  respect  to  £ 


To  apply  the  lemma  note  from  (3.2)  and  (3.3)  that  the  prior  may  be 
expressed  in  the  required  form  with  r*2,  y=3,  ^  =  o~2,  B2  *  B,  v1  * 

^1^2'  V2  =  an<*  V3  ”  ®i*  A^ter  some  manipulations  the  equations 

reduce,  under  our  approximation  to  the  posterior  distribution,  to 

®2  =  P_1(X  "  -  U>  +  P_1  trace(Rg)  (5.3) 

and 
B2  - 

<X  -  (L2)T  *2(£  ~  >*2*  +  trace^2e>  ~  -  U2)T^2(H  -  ^2)T  ~  trace (^1^2) 

(X  -  -!!,>♦  trace  (^(J)  -  (fc  -  )*,)%((*  -  H ,  >  -  trace  (fc"1^) 

where 

^1  “  ^1  d1 

and 

^2  *  -2  d2 

with  the  A's  and  £'s  defined  in  (3.7).  Equations  (5.3)  and  (5.4)  may  be 
solved  by  cyclic  substitutions  in  conjection  with  (3.8)  and  (4.3). 
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6.  FILTERING  A  NORMAL  MEAN  VALUE  FUNCTION  OR  REGRESSION  FUNCTION. 

We  now  consider  a  very  special  case  of  the  optimization  problem  in  (3.1) 

and  (3.2)  which  will  be  relevant  when  the  observations  y  are  inde- 

1  in 

pendent  and  normally  distributed,  given  their  respective  means  g(t1 ),•••, 

2 

g(tm)  and  common  variance  t  .  Consider  the  optimization  with  respect  to 
g  of 

I*!  (9)  «  T-2  l  (Vi  -  g(t.)2)  +  r1  /b[g(2)(t))2dt.  (6.1) 

i=1  a 

The  first  term  on  the  right  hand  side  relates  to  the  sampling  log-likeli¬ 
hood  under  our  normal  theory  assumptions.  The  second  term  may  be  obtained  by 
2  2 

letting  o  *  •  and  8o  ♦  $  in  our  Orstein-Uhlenbeck  process  (3.2),  and 

represents  an  integrated  Wiener  process. 

The  optimizer  g  of  (6.1)  is  well-known  (e.g.  Wahba,  1975,76)  to  be  a 

2  -1 

cubic  spline  with  n  knots.  We  now  consider  the  estimation  of  T  and  4 
and  consider  linear  combinations  of  the  form  described  in  (3.3)  where  ^  and 
^(t)  are  p-dimensional.  Note  that  when  p  «  n  this  assumption  is  exact 
under  the  appropriate  choices  of  basis  functions  for  £(t)  (which  leads  to  a 
general  B-spline  for  g  with  n  knots  and  no  constraints  on  the  coef¬ 
ficients).  When  p  is  fixed  to  be  around  10  or  15  we  have  approximations 
which  will  be  particularly  useful  when  n  is  large.  However,  the  theory 
described  below  relates  to  both  of  these  exact  and  approximate  situations. 
Under  the  linear  assumption  in  (3.3),  (6.1)  may  be  replaced  by 

Vx>  =  T~2(x  "  s,Ts<x  -  &>  +  y_1xt£2x  (6.3) 

where  A2  is  given  after  (3.7), 

*  =  e-1(j  (6.4) 

m  T 

i  -  I  t<vt  <v  <6*5> 

i*1 

and 
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(6.6) 


m 

fc  =  I  y^M. 

i=1 

This  reduces  the  optimization  problem  to  the  classical  Bayesian  problem 

2  -i 

of  estimating  a  regression  vector  X  under  a  multivariate  normal  N(0,o  A _  ) 
prior,  giving  the  standard  result 

Xly  '  N(X,D)  (6.7) 

where 

X  =  t"2<t“2£  +  ^“1A2)"1h  (6.8) 

-  +  **2>_1fc 

with 

A  =  t2/$  (6.9) 


and 


*  x_2(g  +  AA2).  (6.10) 

2 

The  parameters  A  and  T  may  now  be  estimated  by  any  one  of  the  following 
three  procedures 

( a )  Application  of  the  EM  algorithm. 

By  a  direct  application  of  the  EM  algorithm  described  by  Dempster  et.  al. 

2 

it  is  possible  to  show  that  the  marginal  maximum  likelihood  estimates  of  T 
and  $  satisfy  the  equations 

sn 

t2  «  m_1  l  (y  -  jT$(t. ))2  +  m-1trace(D£)  (6.11) 

i-1  1  1 


and 


♦  -  P-1XT$2X  +  p"1  trace (g) .  (6.12) 

Equations  (6.11)  and  (6.12)  may  be  solved  by  cyclic  substitutions.  Eval- 

2  2 

uate  (6.8)  and  (6.10)  for  trial  values  of  T  and  $  ,  substitute  for  X  and 

2 

D  on  the  right  hand  sides  of  (6.11)  and  (6.12),  obtain  new  values  for  T 
and  4  on  the  left  hand  side  and  repeat  the  procedure  until  convergence.  The 
latter  is  guaranteed  under  general  results  governing  the  algorithm. 
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2 

Setting  X  *  T  /$  the  above  procedure  will  achieve  the  maximum  of  the 
marginal  likelihood  obtained  by  noting  that  the  marginal  distribution  of  l 

9 

2 

given  t  ,  and  X  is  multivariate  normal  with  zero  mean  vector  and  covari- 
2  -1  -1  -1 

ance  matrix  T  (£  +  X  A2  The  marginal  likelihood  is 

A (T 2 ,X | y )  *  p(y|i2,X) 

«  (T2)-1/2nX1/2p  exp{-V2T-2UA2(A2  +  (6.13) 

for  which  more  standard  optimization  procedures  might  prove  tedious. 

( b )  Bayesian  Methods 

2 

Suppose  that  vu/T  possesses  a  prior  distribution  which  is  chi-squared 

with  v  degrees  of  freedom  and  which  is  independent  of  X.  Then  w  1  is  the 
_2 

prior  mean  of  t  and  v  is  the  prior  'sample  size'.  In  ignorance  situa¬ 
tions  a  small  but  non-zero  value  e.g.  v  =  i  should  be  chosen  for  v.  For 
general  v  we  find  from  (6.13)  that  the  marginal  posterior  density  of  X  is 
ir(X|y)  «  x(X)X  1/2P[vu  +  X£A2($2  +  Xg)-1p£|]“  (V+n)  (0  <  X  <  »)  (6.14) 

where  *(X)  may  be  set  proportional  to  unity  in  ignorance  situations,  with 
any  proper  prior  distribution  permissible  in  the  presence  of  prior  knowledge 
about  X.  Note  that  Bayes  estimates  for  X  may  be  easily  obtained  from 

(6.14).  For  example,  the  posterior  mean  may  be  obtained  via  the  obvious  one- 

-2 

dimensional  integration.  Furthermore  the  posterior  mean  of  t  is 

E(t’2|y)  -  /  E(T“2|X,y)w(X|y)dX  (6.15) 

where 

E(T"2|X,y)  -  (v  +  n)/[vw  +  Xi£2(A2  +  Xp)-1p£]  (6.16) 

and  the  unconditional  posterior  mean  vector  of  is 

E(*|y)  *  /  EC)(|X,^)ir(X|y)dX  (6.17) 

where  the  first  contribution  to  the  integrand  is  given  in  (6.8). 


These  results  may  be  generalised,  as  required,  to  give  the  unconditional 
covariance  matrix  and  posterior  distribution  of  \  and  the  unconditional  mean 
value,  covariance  kernel,  together  with  unconditional  posterior  probabilities, 
for  the  function  g(t)  *  ^%{t). 

( c )  Cross-Validation. 

Wahba  (1976)  and  others  estimate  X  by  an  empirical  cross-validation 
procedure  which  does  not  relate  to  the  likelihood  function  in  (6.13).  We 
imagine  that  her  procedure  will  prove  appealing  when  the  statistician  is  not 
clear  about  his  particular  choice  of  prior  functional  in  (6.1). 
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